- 
                Notifications
    
You must be signed in to change notification settings  - Fork 25.6k
 
Clamp exponential histogram percentiles to min/max #135632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clamp exponential histogram percentiles to min/max #135632
Conversation
| 
           Pinging @elastic/es-storage-engine (Team:StorageEngine)  | 
    
| result = values.valueAtPreviousRank() * (1 - upperFactor) + values.valueAtRank() * upperFactor; | ||
| } | ||
| return removeNegativeZero(result); | ||
| return removeNegativeZero(Math.clamp(result, histo.min(), histo.max())); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm percentiles shouldn't return values outside min and max.. Should we be doing interpolation between min/max and the borderline ranks instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should probably have hardcoded logic for p0 and p100, now that you have min and max.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To rephrase what this PR tries to solve:
- Lets assume in our histogram the highest, populated bucket is 
[1,2]. - The percentile algorithm assumes that all values which fell into the bucket have the value 
1.3333(point of least relative error). Therefore if a percentile falling into that bucket is requested,1.3333would be returned - However, if the 
maxof the histogram is actually1.1, we know that1.333is incorrect. The[1,2]bucket was populated with values in the range[1, 1.1]. 
So what this PR does is that if the percentile falls into the highest (or lowest) bucket, it adjusts the assumed value for that bucket to move inside of min and max respectively.
If the percentile we are estimating does not lie in the outermost buckets (the ones containing min and max), the clamping has no effect: The estimated bucket center is bigger than min and smaller than max anyway.
Therefore I don't understand what (a) the interpolation you are suggesting would do and (b) why we should have a hardcoded logic for p0 and p100, as those are covered by the existing logic correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's see. If max is 1.1, i.e. less than the polre :P, the polre should not be used for the highest bucket. Instead, we should be interpolating between the polre of the second-highest bucket and the max value. Using the polre for the highest bucket is provably inaccurate, in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand what you mean now:
You are referring to the case where the percentile lies between the second highest and the highest bucket, and therefore is interpolated, right?
That means that it is better to clamp the ValueAndPreviousValue values before we do the interpolation, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, wanna give it a try and add some tests to see what you get?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 95166d9, which also adds a test which failed with the previous behaviour.
Before this PR, the exponential histogram percentiles algorithm would return the midpoint of the bucket for percentile queries.
With the exact
minandmaxknown and stored with the histogram, this can lead to the confusing case where e.g. ap99is estimated to be larger than themaxof the histogram.This PR avoids this problem by clamping estimated percentiles to the
[min, max]range.If the histogram was constructed without the exact
minandmaxbeing known, those are estimated using the lower/upper bound of the outermost buckets. Therefore in that case, clamping will not have an effect on the percentile computation.